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ABSTRACT 

Computerized speech could enhance the effectiveness 
of computer-assisted instruction as an educational tool. Digital 
audio under computer control allows a very wide range of replies, but 
it poses special problems in the areas of listener attitudes and 
speaker intelligibility. This paper discusses the design and 
implementation of special tests to discover a speaker who would be 
most pleasing and intelligible to students using a random access 
digital audio in a computer-assisted instruction system. Auditions 
were for both amateur and professional speakers, male and female. 
Junior college students rated the voices for likeability and 
intelligibility. Those who scored highest in the two tests all had 
some professional voice training and spoke in a mid-range pitch. As 
was expected , the’-e was a correlation between intelligibility and 
attitude,. Appendices contain raw scores and illustrative figures. (JY) 
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The spoken word is an integral part of a child's education, and 
computerized speech could enhance the effectiveness of computer- 
assisted instruction as an educational tool. Conventional analog | 
tape recording methods do not readily permit random access of 
numerous replies to cover a wide range of learning situations. 
Digital audio under computer control allows a very wide range of 
replies, but it poses special problems in the areas of listener 
attitude and speaker intelligibility. This paper will discuss the 
design and implementation of special tests to discover a spe^er 
who would be most pleasing and most intelligible to students using 
random access digital audio in our computer— assisted instruction 
system. 




Let us begin with an examination of the basic difference between 
analog and digital audio. Figure 1 shows one of the many methods 
we have to store sounds* xn this case, by musical notation. The 
listener, a trained musician, converts the musical tones that he 
hears to musical notes which he records on paper . In this written 
form the music can be stored indefinitely, but it c^ be repro- 
duced as music at any time by another trained musician. 



Another storage system, the most efficient way to store sounds for 
computer control, is to convert sounds analog signal into a digital 
format for computer processing as shown in Figure 2. The digital 
format permits an ease of access and control for the audio infor- 
mation, and it also permits storage on a standard computer disc 
unit. 



For those of you who are not familiar with a computer disc unit, 
one is shown in Figure 3. Note the similarity to record discs. 

These discs are coated with a magnetic recording substance which 
may be reached by the movable heads shown to your left. The 
important thing to be known here is that there are 2000 recording 
tracks on such a unit, and any track can be reached in less than 
one-tenth of a second. Digital audio stored on these tracks may 
be accessed quickly to compose sentences for playback as shown in 
Figure 4. 

Although intelligible speech has been synthesized by various methods, 
the artificial speech quality has been judged to be a possible source 
of interference with the learning process at this stage of synthesized 
speech development. Thus we have chosen to operate at the word 
level, with sentences constructed from whole words that have pre- 
viously been stored on a computer disc unit. This would be approx- 
the same as recording several thousand words on small lengths of 
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recording tape, and then composing a message by splicing the proper 
pieces of tape. The computer performs the task at the rate of^ 
approximately 40 words per second, and this permits the composition 
of messages for more thcun one user at a time. 

The tape splicing or computer splicing of words to form sentences 
leads to the first problem in the area of learning. The message 
must be understandable, and yet it is being composed of words spoken 
out of context. The speaker who is chosen for such a digital audio 
system must be able to pronounce the words in such a way as to 
minimize the contextual conflicts in pronunciation while at the same 
time achieving a high rate of intelligibility. In this case intel- 
ligibility is the prime factor with attitude playing a major 
supporting role. 

The ability to achieve a high rate of intelligibility while mini- 
mizing the contextual problem of pronunciation might not be 
restricted to professional announcers. Our auditions included both 
amateur and professional speakers with approximately an egual 
number of males and females. Each speaker read a list of mono— 
i?yllables chosen at random from the Harvard monosyllable lists, 
and they also read sentences designed to cover the normal range 
of pronunciation problems. 

The time and effort required to run intelligibility tests dictated 
of necessity our decision to run the attitude tests first, and then 
measure the intelligibility levels of the top seven speakers. The 
test design is a balanced incomplete factorial design as shown in 
Figure 5. In this test, every speaker is compared to every other 
speaker twice to permit each speaker to have the first position in 
a binary comparison. The test is divided into many subsections 
in which the listeners hear one speaker and then another. The 
listeners are then asked to indicate their preference for speaker A, 
speaker B, or neither speaker. There are 342 speaker comparisons, 
and each test group (there are six groups) is asked to rate one- 
sixth of the comparisons, or 57. 

Each comparison consists of one speaker saying three words, ^d then 
another speaker saying the same three words. To eliminate listener 
fatigue, there are ten words in a list, and each comparison moves to 
the next three words on the list. Thus the speakers and words are 
constcintly changing. To produce the type of test I have just described 
by conventional tape splicing or dubbing methods would be a consider- 
able effort. The audio delivery program was modified to have the 
computer select the six words for each pair of speaker comparisons 
and the test tapes were produced under computer control in less than 
two hours. Note that the computer not only selected the word pairs, 
it also played the audio comparisons. Then a regular tape recorder 
was used to record the audio test generated by the computer. Here is 
a sample of the comparison tapes; all nineteen voices are included 
in the sample. (Play audio tape segment one) . 
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The seven finalists with the highest scores in the attitude test 
were allowed to read the intelligibility tests, which are constructed 
from six standard intelligibility tests as specified by the Acoustical 
Society of America.^ Each test contains 50 monosyllabic words, and 
each word is spoken in the statement "Would you write ______ now?" read 

as a simple declarative sentence. In this case the computer was not 
used; rather a delta modulation simulator was used to provide the 
equivalent audio output for the intelligibility tests. The computer 
could have been employed to generate the tests, but the linear nature 
of the material permitted a straightforward recording approach. Here 
are recorded samples of the seven speakers who participated in the 
intelligibility tests. (Play audio tape segment two). 

The tests were administered to the six listener groups over a two 
day interval in the same room with the same playback configuration. 

The listeners wore stereo earphones which were connected in a monaural 
mode. Foothill Junior College students were paid for their partici- 
pation, and they were selected on the basis of their willingness to 
participate. Any hearing defect automatically disqualified a potential 
test subject. 

The tests went well. The students were generally eager to participate, 
and they definitely had opinions about the speakers, as the test 
results show. The test design had been pretested on a group of 
randomly selected RCA employees, and this helped to eliminate any 
potential confusion in the real tests. At least two persons were 
present to supervise each group of six students, and ensure that no 
horseplay or confusion arose. 

The tests were graded by two independent groups to ensure accuracy. 

The attitude scores are shown in Figure 6. The adjusted score is 
obtained by adding two points for each win and one point for each 
tie. The top two scores have a considerable margin over the next 
six scores which are in the 220-230 range. Also, note that the top 
score is greater than three times the smallest score. 

The intelligibility scores are shown in Figure 7. ^Although the same 
speaker scored highest in both test phases, there is a change in the 
second highest position. Speaker O, a commercial radio announcer, 
has an 88% intelligibility score, although he is more than 40 points 
lower in attitude than speaker F. 

The four highest scoring speakers had some form of professional speech 
training, and one is a commercial radio announcer in S an ^ Francisco. 

In general , the female voices tend to be low in pitch while the male 
voices tend to be high among the high scorers. This would suggest 






^American Standard Method for Measurement of Monosyllabic Word 
Intelligibility, Sponsored by the Acoustical Society of America. 
Approved May 25, 1960. 
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that a mid-range pitch might be best for our digital audio system. 

Note the consistency in the attitude and intelligibility scores. 

There may be an interaction at work here as a high intelligibility 
score may produce a high attitude rank. One important feature of 
liking a voice should be understanding the voice. 

The highest scoring voice was used to produce a working dictionary 
of approximately 600 words to be used for a digital audio system 
as part of a computer -as sis ted instruction system. Here are some 
computer output. (Play audio tape segment three) . Although it will 
probably never be possible to reproduce perfectly natural speech 
from words spoken out of context, the sample you have just heard is 
well over 90 % intelligible when played over earphones in our 
installation. 

Future studies should be performed to determine the type of voice 
best suited to a learning situation, or if many voices will serve 
in this application. The listener fatigue effect should be studied 
to see if digital audio becomes more or less pleasant with time. 

And in all of these studies it should be possible to use the computer 
to generate many tests in a fraction of the time necessary with analog 
recording techniques. The quality of digital audio is a function of 
the storage space required on the disc unit. If fewer words are 
stored, the quality of the digital audio system can be greatly 
enhanced while the advantages of computer processing are retained. 

Further research in synthesized speech may permit us to generate 
thousands of words from some type of basic speech units. In the 
meantime we are striving to produce the best possible word oriented 
system to be used in industrial and computer -as sis ted instruction 
applications . 
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FIGURE 5 - BALANCED INCOMPLETE 
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A TO S - SPEAKERS 1 TO 19 

1 TO 10 = STARTING WORD 

OF THREE WORD PAIRS 



